Two different jobs — one finds new failures, the other measures known ones repeatably
Day 16 of 60
Week 3 you red-teamed — you went hunting for failures nobody had catalogued yet. This week you switch jobs. Evaluation takes the failures you (and others) already know about and measures them repeatably: same prompts, same scoring, run again next week and next model, so you can answer the only question a release decision actually turns on — is this model getting safer or worse?
Red-teaming is discovery; evaluation is measurement. A red-team is creative, open-ended, and never the same twice — that's the point. An eval is frozen, scored, and re-runnable — that's its point. Confusing the two gives you a red-team you can't trust as a trend line, or an eval that never finds anything new.
Today is conceptual: you install the distinction cleanly, because every later day this week (the scorecard, the eval set, the harness) is downstream of getting it right. A safety eval that's flattering-but-broken is worse than none — it ships a dangerous model with a green checkmark.
Red-teaming maximizes coverage of the unknown: surprise me, find the failure I didn't anticipate. Evaluation maximizes repeatability: give me the same number under the same conditions so I can compare across time and models. You can't optimize both in one activity.
A red-team produces incidents — concrete examples of a model doing something bad, often one-offs. An eval produces metrics — rates over a fixed set (safe-refusal rate, harmful-compliance rate). Incidents become evals when you freeze them into a scored test.
Evals go stale: once a test set is known, models can be tuned to it (or it leaks into training — contamination). Red-teams go blind: a tired team stops being creative and re-finds the same things. Each covers the other's weakness, which is why mature safety programs run both.
They feed each other. A red-team finds a new jailbreak → you turn it into eval cases → the eval tracks whether the next model still falls for it → when the eval saturates (everyone passes), you red-team again for the next unknown. Discovery refills the measurement set; measurement tells you when to go discover more.
Not all evals measure the same thing. The highest-stakes ones — dangerous-capability evaluations — don't ask "did the model say something rude?" They ask whether a model has crossed a threshold of capability that would matter for catastrophic risk: persuasion, cyber-offense, self-proliferation. DeepMind's Evaluating Frontier Models for Dangerous Capabilities is the reference for how a lab designs these, and it shows the rigor a real eval demands: clear capability definitions, graded difficulty, and honest reporting of uncertainty.
The eval you build (Day 18) is a small, behavioral one — refusals, not self-proliferation — but it inherits the same discipline: define what you're measuring precisely, fix the conditions, and report a number you'd defend in a review. Scale changes; the craft doesn't.
The single most common way an eval lies: contamination. If your test prompts (or close paraphrases) appeared in the model's training data, a high score measures memorization, not safety. The same trap catches public benchmarks the moment they're popular enough to be scraped. Before you trust any benchmark number, ask: could the model have seen this set? Anthropic's Challenges in Evaluating AI Systems is an honest tour of this and the other ways evals quietly mislead.
A beginner uses "red-team" and "eval" interchangeably. An expert keeps them surgically separate, because they answer different questions and fail in different ways — and an expert knows the real value is the loop between them: discovery feeds measurement, and saturated measurement signals it's time to discover again. Owning that loop is owning whether a release decision rests on a number you can defend.
Say this in an interview: "Red-teaming is discovery, evaluation is measurement — I don't conflate them. Red-teams find the unknown failures; evals freeze the known ones into a repeatable score so I can tell whether a model is improving. And I treat contamination as the first thing to rule out, because a benchmark a model trained on measures memory, not safety."